## The Processor: Single-Cycle Implementation



Caiwen Ding
Department of Computer Science and Engineering
University of Connecticut

CSE3666: Introduction to Computer Architecture

#### **Outline**

We will first implement a simple single-cycle RISC-V processor We will then improve it to a more realistic pipelined version (next lecture)

- Overview, after review
- Datapath
- Control
- Performance

} Part / ( Velording)

Part 2

Reading: Chapter 4.1 - 4.4

## Review of building blocks







## **Sequential elements**



#### A Subset of RISC-V Instructions

Simple subset that shows most aspects

- Arithmetic/logical: add, sub, and, or

- Memory reference: lw, sw

- Branch: beq

| Type    | Instruction | Opcode   | Funct3 | Funct7   |
|---------|-------------|----------|--------|----------|
| R-Type  | add         | 011 0011 | 000    | 000 0000 |
| R-Type  | sub         | 011 0011 | 000    | 010 0000 |
| R-Type  | and         | 011 0011 | 111    | 000 0000 |
| R-Type  | or          | 011 0011 | 110    | 000 0000 |
| I-Type  | lw          | 000 0011 |        |          |
| S-Type  | SW          | 010 0011 |        |          |
| SB-Type | beq         | 110 0011 |        |          |

## **Instruction encoding**

#### Fields are at the same location in all encoding formats

opcode, funct3, rs1, rs2, rt

| Name          |                 |        | Fie   | lds    |               |        |
|---------------|-----------------|--------|-------|--------|---------------|--------|
| (Bit position | n) <b>31:25</b> | 24:20  | 19:15 | 14:12  | 11:7          | 6:0    |
|               |                 |        |       |        |               |        |
| (a) R-type    | funct7          | rs2    | rs1   | funct3 | rd            | opcode |
|               |                 |        |       |        |               |        |
| (b) I-type    | immediat        | [11:0] | rs1   | funct3 | rd            | opcode |
|               |                 | 0      |       |        |               |        |
| (c) S-type    | immed[11:5]     | rs2    | rs1   | funct3 | immed[4:0]    | opcode |
|               | 00              |        |       |        |               |        |
| (d) SB-type   | immed[12,10:5]  | rs2    | rs1   | funct3 | immed[4:1,11] | opcode |
|               | (100)           |        |       |        |               |        |
|               | •               |        |       | J      |               |        |

How are the instructions executed?

#### **Execution of instructions**

What are the steps to execute instructions?

```
add rd, rs1, rs2 # sub/and/or lw rd, offset(rs1)
sw rs2, offset(rs1)
beq rs1, rs2, offset
```

- How does the processor get the instruction?
- How does the processor get operands?
- How does the processor generate result, using what modules?
- How does the processor save the result?

## **Steps in Instruction Execution**

| Hardware       | Hardware R-type |                 | Branch    |
|----------------|-----------------|-----------------|-----------|
| PC, I-Mem      |                 |                 |           |
| RF and control |                 |                 |           |
| ALU            | Compute result  | Compute address | Compare   |
| Data Memory    |                 | read/write      | Update PC |
| RF             | Write           | Write (load)    |           |

#### **CPU Overview**



## But we cannot just join wires together



#### **Control and MUX**



### **Building a Datapath**

- Let us build a RISC-V datapath incrementally
  - Refining the overview design
- Datapath: Elements that process data and addresses in the CPU
  - Registers, ALUs, MUXes, memories, ...

Pay attention to details!

#### **Instruction Fetch**



If we remove the instruction memory, did we see similar circuit before?

## **R-Type Format Instructions**

- Read two register operands
- Perform arithmetic/logical operation
- Write register result



14

#### **Load/Store Instructions**

- Read register operands
  - One or two registers are needed?
- Use ALU to calculate address using sign-extended 12-bit offset
- Load: Read memory and update register
- Store: Write register value to memory



a. Data memory unit

b. Immediate generation unit

## R-Type/Load/Store Datapath



#### **Branch Instructions**

- Read two register operands
- Compare them with ALU
  - Do subtraction and check Zero
- Calculate target address: PC + immediate
  - Additional adder (because ALU is used for comparison)

#### **Branch Instructions**



Figure 4.9

## Full Datapath (w/o control)



Figure 4.11

19

## **Datapath Summary**

- First-cut datapath can execute an instruction in one clock cycle
- Each datapath element, e.g., RF and ALU, can only do one function at a time
  - They cannot be used twice
  - Hence, we need separate instruction and data memories
- Use multiplexors where alternate data sources are used for different instructions

We still need control signals: select to MUXes, ALU operations, RF write, Memory read, Memory write

## **Datapath with control (preview)**



21

#### **ALU** and its functions

- ALU performs functions specified by 4-bit ALU operation
- Design a combinational circuit to generate ALU operation
  - We call the module ALU Control

| ALU operation | Function |
|---------------|----------|
| 0000          | and      |
| 0001          | or       |
| 0010          | add      |
| 0110          | subtract |



1

This is the table we had after the ALU design. Whoever design ALU provides the table.

## **Design ALU Control**

- For each instruction,
  - What operations do we want ALU to perform?
  - How do we identify it from fields/bits in the machine code?

| Instruction | Operation to perform on ALU | Fields in machine code |
|-------------|-----------------------------|------------------------|
| add         | add                         | [ODeada]               |
| and         | and                         | OFBOCE                 |
| or          | o Y                         | Fan 7                  |
| lw          | add Coffeet                 |                        |
| SW          | add Cottet                  | )                      |
| beq         | Sub ( sero)                 |                        |

### **Checking opcode**

- Opcode is checked in the main control, which generates a 2-bit signal ALUOp to indicate the instruction type
- ALU control uses 2-bit ALUOp, instead of opcode directly
  - Has enough information to generate ALU operation

| Instruction | ALUOp |  |  |
|-------------|-------|--|--|
| lw          | 00    |  |  |
| SW          | 00    |  |  |
| beq         | 01    |  |  |
| R-type      | 10    |  |  |

## **ALU Control Input and Output**

• If ALUOp is 10 (R-type), check more bits in funct3 and funct7

If we do not check funct7, which operation cannot be done? Instruction **ALUOp** Operation funct7 funct3 **ALU** ALU operation opcode function 00 load word add 0010 lw XXX XXXX XXX 00 add 0010 store word SW XXX XXXX XXX branch if equal subtract 0110 bea 01 XXX XXXX XXX 0010 R-type 10 add 000 000 000 add 010 0000 subtract 000 subtract 0110 000 0000 111 0000 and and 000 0000 110 0001 or or

#### Implementation of ALU Control (Hardware)

We only need four bits from funct fields (bits 30, 14, 13, and 12)



Write the logic expression for each bit in Operation

For each bit, write a product term for each row where the operation bit is 1, and then OR the product terms together. Then simplify the expression.

#### Implementation of ALU Control (Software)

• In software HDL/simulation, we can describe the behavior

For example, in MyHDL,

```
if ALUOp == 0b00:
    ALUOperation.next = 0b0010
elif ALUOp == 0b01:
    ALUOperation.next = 0b0110
# more cases
```

0611

Software can also generate each bit using logic expressions

#### Datapath with ALU control



## Generating control signals from opcode

| Inst.  | ALUSrc | Memto<br>Reg | Reg<br>Write | Mem<br>Read | Mem<br>Write | Branch | ALU<br>Op |
|--------|--------|--------------|--------------|-------------|--------------|--------|-----------|
| R-type | 0      | 0            | //           | 0           | 0            | O      | 10        |
| lw     |        |              |              |             |              |        | 00        |
| SW     |        |              |              |             |              |        | 00        |
| beq    |        |              |              |             |              | 7      | 01        |



## Generating control signals from opcode

| Inst.  | ALUSrc | Memto<br>Reg | Reg<br>Write | Mem<br>Read | Mem<br>Write | Branch | ALU<br>Op |
|--------|--------|--------------|--------------|-------------|--------------|--------|-----------|
| R-type | 0      | 0            | 1            | 0           | 0            | 0      | 10        |
| lw     | 1      | 1            | 1            | 1           | 0            | 0      | 00        |
| SW     | 1      | X            | 0            | 0           | 1            | 0      | 00        |
| beq    | 0      | X            | 0            | 0           | 0            | 1      | 01        |

X means don't care. It can be 0 or 1. Designers can pick a value to optimize the circuit.

Figure out the values yourself from the diagram (not just memorizing them).

## Example: generating control signal from opcode

|        | <u></u>   | 7          |              |              |             |              |        |
|--------|-----------|------------|--------------|--------------|-------------|--------------|--------|
| Inst.  | Opcode    | ALU<br>Src | Memto<br>Reg | Reg<br>Write | Mem<br>Read | Mem<br>Write | Branch |
| R-type | 011 0014  | 0          | 0            | 1            | 0           | 0            | 0      |
| lw /   | 000 0011  | 1          | 1            | 1            | 1           | 0            | 0      |
| sw     | 010 00 11 | 1          | X            | 0            | 0           | 1            | 0      |
| beq    | 110 0011  | 8          | X            | 0            | 0           | 0            | 1      |

Op6, Op5, ..., Op0 are bit 6, bit 5, ..., and bit 0 in the opcode.

RType = 
$$\overline{Op6} \cdot Op5 \cdot Op4$$

Load = 
$$\overline{0p6} \cdot \overline{0p5} \cdot \overline{0p4}$$

Store = 
$$\overline{0p6} \cdot 0p5 \cdot \overline{0p4}$$

Branch = 
$$0p6 \cdot 0p5 \cdot \overline{0p4}$$

Opcode[3:0] are the same

ALUSrc = Load + Store

$$RegWrite = RType + Load$$

## **Datapath with control**



32

# Operation of Datapath: R-Type Instruction

| Inst.  | Opcode   | ALU<br>Src | Memto<br>Reg | Reg<br>Write |   |   | Branch |
|--------|----------|------------|--------------|--------------|---|---|--------|
| R-type | 011 0011 | 0          | 0            | 1            | 0 | 0 | 0      |



## **Operation of Datapath: load**



## **Operation of Datapath: store**



Operation of Datapath: beq 741, 752, (whe

Mem Inst. **Opcode ALU** Memto Reg Mem **Branch** Write Src Reg Read Write 110 0011 X beq 0 M Add tuo c'i's

> to enable

branch Add Sum Branch MemRead Instructor [0\_0] MemtoRea Control **ALUOp** 0011 MemWrite **ALUSrc** RegWrite Instruction [19-15] Read Read register 1 Read address Instruction [24-20] data 1 Read Pero register 2 Instruction ALU AddressRead [31-0] Read Instruction [11-7] Write result data 2 Instruction Mux M register memory Write 0110 data Registers Data Write emory data Instruction [31-0] Imm ALU Gen control 01 36 Instruction [30,14-12]

#### Which instruction will be affected?



37

## Signal values



## Find the signal values 12-614

Imm

Control signals

| Inst. | ALUSrc | Memto<br>Reg | Reg<br>Write | Mem<br>Read | Mem<br>Write | Branch | ALU<br>Op |
|-------|--------|--------------|--------------|-------------|--------------|--------|-----------|
| lw    |        |              |              |             |              |        | 00        |

- rs1, rs2, rd =
- immediate =
- branch target address <del>X</del>
- next PC = PL+4 = 0x 04008010

Addr: 0x0400 800C Instr:0x010A 2503 lw x10, 16(x20)

| Name         |                | Fi                | eld     | Comments |               |        |                               |
|--------------|----------------|-------------------|---------|----------|---------------|--------|-------------------------------|
| (Field Size) | 7 bits         | 5 bits            | 5 bits  | 3 bits   | 5 bits        | 7 bits |                               |
| R-type       | funct7         | rs2               | rs1     | funct3   | rd            | opcode | Arithmetic instruction format |
| 1-type       | immediate      | [11:0]            | rs1     | funct3   | rd            | opcode | Loads & immediate arithmetic  |
| S-type       | immed[11:5]    | rs2               | rs1     | funct3   | immed[4:0]    | opcode | Stores                        |
| SB-type      | immed[12,10:5] | rs2               | rs1     | funct3   | immed[4:1,11] | opcode | Conditional branch format     |
| UJ-type      | imme           | ediate[20,10:1,11 | ,19:12] |          | rd            | opcode | Unconditional jump format     |
| U-type       |                | immediate[31:1    | [2]     |          | rd            | opcode | Upper immediate format        |

#### **Support more instructions**

- How would you add more instructions?
  - What existing blocks are used?
  - Which new functional blocks is needed (if any)?
  - What new signals need to be added (if any)?

```
xor
addi
bne
```

Homework

```
jal
jalr
```

#### How fast can this processor run?

- Assume the following delays
  - 80 ps for decoding
  - 100 ps for register read or write
  - 200ps for ALU
  - 300ps for memory

- Ignore other delays (e.g., propagation delay of registers and MUXes)

| Instr  | Instr<br>fetch | Register<br>read | ALU   | Memory<br>access | Register<br>write | Total time |
|--------|----------------|------------------|-------|------------------|-------------------|------------|
| lw     | 300ps          | 100ps            | 200ps | 300ps            | 100ps             | 1000ps     |
| sw     | 300ps          | 100ps            | 200ps | 300ps            |                   | 900ps      |
| R-type | 300ps          | 100ps            | 200ps |                  | 100ps             | 700ps      |
| beq    | 300ps          | 100ps            | 200ps |                  |                   | 500ps      |

## Why is single-cycle implementation not used today?

Execution time = Instruction Count  $\times$  CPI  $\times$  Cycle Time

Instruction Count is determined by ISA

CPI = 1

Cycle Time is decided by the slowest instruction (which one?)

Make common case fast!

#### **Steps in Instruction Execution**

- Use address in PC to fetch instruction from instruction memory
- Decode the instructions and read register file (RF)
- Arithmetic or logical operations
  - Use ALU to perform the operation
  - Set the correct signals for updating the destination register
- Load/Store
  - Use ALU to calculate memory address
  - Access data memory for load/store
  - Set the correct signals for updating the destination register, for load
- · Branches Wife back
  - Use ALU to compare
  - Calculate branch target address, using a separate adder
  - Select proper address for the next instruction

#### The Main Control Unit

The main Control Unit generates control signals from opcode The table can be easily seen from the diagram.

How would you set these signals for R/I/S/SB type instructions?

| Signal name | Effect when deasserted                                                           | Effect when asserted                                                                                    |
|-------------|----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| RegWrite    | None.                                                                            | The register on the Write register input is written with the value on the Write data input.             |
| ALUSrc      | The second ALU operand comes from the second register file output (Read data 2). | The second ALU operand is the sign-extended, 12 bits of the instruction.                                |
| PCSrc       | The PC is replaced by the output of the adder that computes the value of PC + 4. | The PC is replaced by the output of the adder that computes the branch target.                          |
| MemRead     | None.                                                                            | Data memory contents designated by the address input are put on the Read data output.                   |
| MemWrite    | None.                                                                            | Data memory contents designated by the address input are replaced by the value on the Write data input. |
| MemtoReg    | The value fed to the register Write data input comes from the ALU.               | The value fed to the register Write data input comes from the data memory.                              |

#### **Implementation of ALU Control (Hardware)**

#### Write the logic expression for each bit in Operation

```
Operation[1] =
  (~ALUOp1 & ~ALUOp0) | ALUOp0 | (ALUOp1 & ~I[14] & ~I[13] & ~I[12])
Operation[2] =
  (ALUOp0) | (ALUOp1 & I[30] & ~I[14] & ~I[13] & ~I[12])
```

| ALU0p1 | ALU0p0 | I[30] | I[14] | I[13] | I[12] | Operation           |
|--------|--------|-------|-------|-------|-------|---------------------|
| 0      | 0      | Χ     | X     | X     | Χ     | 00 <b>1</b> 0       |
| X      | 1      | Χ     | X     | Χ     | Χ     | 01 <mark>1</mark> 0 |
| 1      | X      | 0     | 0     | 0     | 0     | 00 <b>1</b> 0       |
| 1      | X      | 1     | 0     | 0     | 0     | 01 <mark>1</mark> 0 |
| 1      | X      | 0     | 1     | 1     | 1     | 0000                |
| 1      | Χ      | 0     | 1     | 1     | 0     | 0001                |